[A quick reminder regarding the final objectives]

This paper brings to light how network embedding graphs can help to solve major open problems in natural language understanding. One illustrative element in NLP could be language variability - or ambiguity problem - which happens when two sentences express the same meaning (or ideas) with very different words. For instance we may say almost interchangeably: “where is the nearest sushi restaurant?” or “can you please give me addresses of sushi places nearby?”. These two sentences exactly share the same meaning with a different semantic wording. Here is the big challenge that we are struggling. In terms of data science, it bears witness of a well-known problem called text similarity. Indeed, my sparse vectors for the 2 sentences have no common words and consequently will have a cosine distance of 1. This is a terrible distance score because the 2 sentences have very similar meanings.

The first thing that is crossing any data scientists’mind would have been to use popular document embedding methods based on similarity measures such as Doc2Vec, Average w2v vectors, Weighted average w2v vectors (e.g. tf-idf), RNN-based embeddings (e.g. deep LSTM networks), … to cope with this text similarity challenge.

As for us, we will tackle this text similarity challenge by implementing network graph embeddings in light of traditional word embeddings technics.

[In this part]

This part aims at implementing a neo4j graph database. We currently have two pieces of information:

  • We have news - written in English according many topics. This is a text. Here is the micro level. To carry out our final objective, this is the most valuable piece of information.

  • In addition, we have information regarding the context of news - when they are published, where they come from, from which category they fall down, who write them, … Here is a macro level.

The objectives is to play with these two levels in order to classify news at best.

Creation of two graph databases

News in keeping with a context environment

As explained in the previous part, we want to pool all news collected into a unique graph database. The underlying goal is to find some unseeable links in order to answer this set of questions?

  • Is there a specific link between two news coming from the same category [sport, business, technology, …] and published at the same date - but coming from different sources? If so, can we assess that these two given news under consideration are the same? i.e. are they ovelapped?

  • Is there a specific link between two news published at the same hour but related / fallen (in)to two different categories? Let’s say a famous politician guy dies (but he is also a tennis enthusiast). Consequently, this piece of information should be related both in General Category (for the policitian part) as well as sport category (for the sport part)

  • There are plenty of possibilities

Just to give you a first hint of what we pursue, let’s consider an instance. To do so, we drew a sample of news (to finally get couples of news) and put them into perspectives according to their own features.

visNetwork(final_nodes, final_edges)

As we can see, news (in green) are linked together because they share in common some information such as category (in red), a date (in blue) or a source (in yellow).

Let’s have a look at news ploted just above to see if we find concrete and relevant links

kable(news_data[c(1:10),c(2,5,8)]) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
description title category
In the month following Donald Trump’s inauguration it’s clear that Russians are no longer jumping down the aisles. Has Russia changed its tone towards Donald Trump? general
A fasting diet could reverse diabetes and repair the pancreas, US researches discover. Fasting diet ‘could reverse diabetes and regenerate pancreas’ general
Researchers discover what could be one of the worst cases of mine pollution in the world in the heart of New South Wales’ pristine heritage-listed Blue Mountains. Mine pollution turning Blue Mountains river into ‘waste disposal’ general
Yemen is now classified as the world’s worst humanitarian disaster but Australia has committed no funding to help save lives there. Australia ignores unfolding humanitarian catastrophe in Yemen general
Malcolm Turnbull and Joko Widodo hold talks in Sydney, reviving cooperation halted after the discovery of insulting posters at a military base, and reaching deals on trade and a new consulate in east Java. Australia and Indonesia agree to fully restore military ties general
If this is how BlackBerry wants to do hardware, we really won’t miss them. BlackBerry KeyOne Hands On—BlackBerry wants $549 for mid-range device technology
States that legalized gay marriage early created a natural experiment. Same-sex marriage linked to decline in teen suicides technology
We may finally be getting somewhere in our fight against the disease. New malaria vaccine is fully effective in very small clinical trial technology
Latest GTX 1060 laptop is more portable than its big ol’ butt might suggest. Alienware 13 R3: Powerful and pretty, if you don’t mind junk in the trunk technology
But don’t try this at home—the results are mainly in mice and need verifying. A fasting-diet may trigger regeneration of a diabetic pancreas technology

The first answer that is crossing my mind is YES but it is pretty messy and not so obvious just because we took randomly a sample. If we want to find relevant and astonishing example, we must work on the whole dataset.

From now, we are getting down to business by considering all news and not a sample. In term of visualization, I obviously cannot display a graph network for 50K+ news … but with neo4j and visNetwork.js we can. Here is the amazing result reported in a .gif. Let’s have a close eye on it:

Words graph